The aim of this study is to predict tomorrow’s hourly consumption data with using predictive AR and MA models and considering seasonality and trend components. The data is acquired from EPİAŞ, from 1st of January, 2016 till the 20th of May, 2021. The last 14 days will be used for testing the models.
Importing necessary libraries
library("readxl")
library("ggplot2")
library("tidyverse")
library("zoo")
library(plotly)
library("data.table")
library(forecast)
library(ggcorrplot)
library(stats)
library(urca)
library(lubridate)
Target variable is gathered as an csv file from the EPİAŞ:
df <- fread("RealTimeConsumption-01012016-20052021.csv")
str(df)
## Classes 'data.table' and 'data.frame': 47208 obs. of 3 variables:
## $ Date : chr "01.01.2016" "01.01.2016" "01.01.2016" "01.01.2016" ...
## $ Hour : chr "00:00" "01:00" "02:00" "03:00" ...
## $ Consumption (MWh): chr "26,277.24" "24,991.82" "23,532.61" "22,464.78" ...
## - attr(*, ".internal.selfref")=<externalptr>
drop_na(df)
#converting to data table
setDT(df)
#combining Date and Hour column
df$NewDate <-dmy(df$Date)+hm(df$Hour)
#character to numeric values for consumption series
df$`Consumption (MWh)` <-gsub(',','',df$`Consumption (MWh)`)
df$`Consumption (MWh)`<-as.numeric(df$`Consumption (MWh)`)
df$year<-format(df$NewDate, "%y")
df$day<-format(df$NewDate, "%m/%d")
df$month<-format(df$NewDate, "%m")
There are no missing values.
##1. Possible types of seasonality exhibited by hourly electricity consumption The visualization of the time series data:
p <- ggplot(data = df, aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in Turkey between 2016-2021', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)
fig <- fig %>% layout(dragmode = "pan")
fig
For the preliminary analysis, it can be seen that the data has a monthly seasonality. For instance, the electricity consumption is highest in the summer season, i.e between June and September. On the other hand, yearly trend is stable and linear between 2016-2020. If zooming into monthly level in 2016:
p <- ggplot(data = df[1:8500,], aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in 2016', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)
fig <- fig %>% layout(dragmode = "pan")
fig
Now the monthly trend is more observable. The consumption is higher in summer months, with significant amount of energy consumption reduction is observable in 27 March, 07 June and 13 September. I couldn’t find any information about 27 March; however, in 07 June and 13 September, there were national holidays in Turkey. From that informartion, national holidays have a huge impact on energy consumtion, probably because of the reduction of the industrial electricity consumption. Zooming further into weekly, hourly and daily level:
p <- ggplot(data = df[1:745,], aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in January 2016', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)
fig <- fig %>% layout(dragmode = "pan")
fig
p <- ggplot(data = df[1:200,], aes(x = NewDate, y = `Consumption (MWh)`,group=1))+
geom_line(color = '#007cc3') + labs(title = 'Hourly Electricity Consumption in 1-8 January 2016', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)
fig <- fig %>% layout(dragmode = "pan")
fig
From these two figures, there is also a weekly seasonality, i.e the data pattern is obervable in every 7 days. Furthermore, the consumption has its peak between 10:00 AM - 18:00 PM in every day. Therefore hourly seasonality exists too.
To sum up, hourly, daily, weekly and monthly seasonality exist in the consumption data and should be taken into account in the predictive models. Also, the data has stable yearly linear trend. To further investigate and eliminate seasonality and trend components, the decompsition of the time series data in different levels(hourly, daily, weekly, monthly) is required, because ARIMA models work best if we know the seasonality components and make the data stationary. Since the variance is not increasing over the time, additive decomposition will be used. To check whether the data stationary or not, KPSS Unit Root Test is used:
KPSS_test= ur.kpss(df$`Consumption (MWh)`)
summary(KPSS_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 18 lags.
##
## Value of test-statistic is: 12.695
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
plot(KPSS_test)
The null hypothesis is the data is stationary, and the value of test-statistic is far bigger than cricical values. Therefore, the null hypothesis is rejected and the data is non-stationary. Furthermore, from the ACF
df_new <- ts(df$`Consumption (MWh)`,frequency=24)
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
From the figure, seasonality is not clearly observable, but from the weekly figure at the beginning of the study, we know that the consumption has its peak between 10:00 AM - 18:00 PM in every day, meaning that there is a increasing trend in mornings and decreasing trend in nights. Therefore hourly seasonality exists.
To analyze daily trends and seasonality, the hourly data is grouped into daily data by averaging 24 hours’ period.
daily_df <- df %>% group_by(year,day) %>% summarize(DailyAvg = mean(`Consumption (MWh)`))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(daily_df$DailyAvg,frequency=7)
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
The trend and seasonality is clearly observable from the above figure. Random component is similar to white noise, but large deviations can be observed in the national holidays.
To analyze daily trends and seasonality, the daily data is grouped into monthly data by averaging daily period.
monthly_df <- df %>% group_by(year,month,day) %>% summarize(DailyAvg = mean(`Consumption (MWh)`))
## `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument.
monthly_df <- monthly_df %>% group_by(year,month) %>% summarize(Monthly = mean(DailyAvg))
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
df_new <- ts(monthly_df$Monthly,frequency=12)
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
For the monthly data, the seasonality and the trend line are very clearly observable.
Supposing that there is a pattern at every 168 hours(7 days), the decomposition of the time-series data:
df_new <- ts(df$`Consumption (MWh)`,frequency=168)
df_additive <- decompose(df_new,type="additive")
plot(df_additive)
For 7 days’ period, the energy consumption is lowest on Sundays and second lowest on Saturdays(can be seen on the daily interactive figures above. Not only the trend, but also the seasonality exists in the data, and should be eliminated for the predictive models.
For deseasonalizing and detrendng the Time-series, the trend and seasonal components should be extracted from the consumption data:
df_deseasoned_detrended <- df_new-df_additive$seasonal-df_additive$trend
p <- ggplot(data = df_deseasoned_detrended, aes(x = df$NewDate, y = df_deseasoned_detrended,group=1))+
geom_line(color = '#007cc3') + labs(title = 'Deaseasoned and Detrended Data', x= "Date", y = "Hourly Electricity Consumption(MWh)") + theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
fig <- ggplotly(p)
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
fig <- fig %>% layout(dragmode = "pan")
fig
The data is now more stationary and has no increasing or decreasing trend. The deviation from the mean is resulting mostly from national holidays. To determine appropriate AR,MA, and ARMA models, KPSS Unit Root Test can be applied for checking the stationarity of the data for the differencing (d) parameter. Also, ACF and PACF plots should be investigated for p and q parameters:
KPSS_test= ur.kpss(df_deseasoned_detrended)
summary(KPSS_test)
##
## #######################
## # KPSS Unit Root Test #
## #######################
##
## Test is of type: mu with 18 lags.
##
## Value of test-statistic is: 0.0042
##
## Critical value for a significance level of:
## 10pct 5pct 2.5pct 1pct
## critical values 0.347 0.463 0.574 0.739
The null hypothesis is accepted, since value of test-statistic is lower than the cricital values, hence the data is now stationary.
acf(df_deseasoned_detrended,na.action=na.pass)
pacf(df_deseasoned_detrended,na.action=na.pass)
From the ACF figure, the p parameter should be close to 1, and can be further extended into 6-7, and 24(as can be initially guessed). From the PACF figure, the q parameter should either be 1 or 2. For the training of the models, the test set(last 14 days) should be excluded from the dataset. Starting with AR models, i.e models with only autoregressive term(p):
Train-test set split:
df_train <- df_deseasoned_detrended[1:(length(df_deseasoned_detrended)-14*24)]
df_test <- df_deseasoned_detrended[(length(df_deseasoned_detrended)-14*24+1):length(df_deseasoned_detrended)]
AR Models:
model <- arima(df_train, order=c(1,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.9165 0.2836
## s.e. 0.0018 32.5934
##
## sigma^2 estimated as 345752: log likelihood = -364745.1, aic = 729496.2
AIC(model)
## [1] 729496.2
model <- arima(df_train, order=c(2,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(2, 0, 0))
##
## Coefficients:
## ar1 ar2 intercept
## 1.3305 -0.4518 0.2830
## s.e. 0.0041 0.0041 20.0044
##
## sigma^2 estimated as 275194: log likelihood = -359405.7, aic = 718819.4
AIC(model)
## [1] 718819.4
model <- arima(df_train, order=c(3,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(3, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 intercept
## 1.3229 -0.4294 -0.0168 0.2831
## s.e. 0.0046 0.0074 0.0046 19.6674
##
## sigma^2 estimated as 275116: log likelihood = -359399.1, aic = 718808.2
AIC(model)
## [1] 718808.2
model <- arima(df_train, order=c(4,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(4, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 intercept
## 1.3223 -0.4431 0.0255 -0.0320 0.2833
## s.e. 0.0046 0.0077 0.0077 0.0046 19.0461
##
## sigma^2 estimated as 274833: log likelihood = -359375, aic = 718762.1
AIC(model)
## [1] 718762.1
model <- arima(df_train, order=c(5,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(5, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 intercept
## 1.3225 -0.4432 0.0272 -0.0368 0.0036 0.2839
## s.e. 0.0046 0.0077 0.0079 0.0077 0.0046 19.1183
##
## sigma^2 estimated as 274830: log likelihood = -359374.7, aic = 718763.5
AIC(model)
## [1] 718763.5
model <- arima(df_train, order=c(6,0,0))
print(model)
##
## Call:
## arima(x = df_train, order = c(6, 0, 0))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 intercept
## 1.3226 -0.4445 0.0281 -0.0518 0.0482 -0.0338 0.2839
## s.e. 0.0046 0.0077 0.0079 0.0079 0.0077 0.0046 18.4739
##
## sigma^2 estimated as 274517: log likelihood = -359348.1, aic = 718712.2
AIC(model)
## [1] 718712.2
#model <- arima(df_train, order=c(24,0,0))
#print(model)
#AIC(model)
The lowest AIC is achieved by p=24.However, due to high computational complexity, p=6 model will be further utilized.
As mentioned before, from the PACF figure, the q parameter should either be 1 or 2 for the MA models. However, increasing q value even more would be a good idea and should be checked too.
model <- arima(df_train, order=c(0,0,1))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 1))
##
## Coefficients:
## ma1 intercept
## 0.8527 0.2825
## s.e. 0.0018 7.5209
##
## sigma^2 estimated as 770951: log likelihood = -383504.5, aic = 767015
AIC(model)
## [1] 767015
model <- arima(df_train, order=c(0,0,2))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 2))
##
## Coefficients:
## ma1 ma2 intercept
## 1.2002 0.6130 0.2827
## s.e. 0.0037 0.0029 8.8227
##
## sigma^2 estimated as 460259: log likelihood = -371437.3, aic = 742882.6
AIC(model)
## [1] 742882.6
model <- arima(df_train, order=c(0,0,3))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 3))
##
## Coefficients:
## ma1 ma2 ma3 intercept
## 1.2839 1.0058 0.4755 0.2827
## s.e. 0.0043 0.0047 0.0035 10.3149
##
## sigma^2 estimated as 351146: log likelihood = -365107.3, aic = 730224.5
AIC(model)
## [1] 730224.5
model <- arima(df_train, order=c(0,0,4))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 4))
##
## Coefficients:
## ma1 ma2 ma3 ma4 intercept
## 1.3153 1.1615 0.7726 0.3117 0.2832
## s.e. 0.0049 0.0066 0.0050 0.0041 11.7820
##
## sigma^2 estimated as 312144: log likelihood = -362353, aic = 724718
AIC(model)
## [1] 724718
model <- arima(df_train, order=c(0,0,5))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 5))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 intercept
## 1.3274 1.2583 1.0048 0.6145 0.2441 0.2840
## s.e. 0.0045 0.0069 0.0071 0.0061 0.0041 13.5799
##
## sigma^2 estimated as 290538: log likelihood = -360675, aic = 721363.9
AIC(model)
## [1] 721363.9
model <- arima(df_train, order=c(0,0,6))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 6))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 ma6 intercept
## 1.3326 1.3004 1.0883 0.7385 0.4029 0.1414 0.2843
## s.e. 0.0046 0.0074 0.0086 0.0081 0.0064 0.0043 14.8003
##
## sigma^2 estimated as 284398: log likelihood = -360175.3, aic = 720366.6
AIC(model)
## [1] 720366.6
model <- arima(df_train, order=c(0,0,7))
print(model)
##
## Call:
## arima(x = df_train, order = c(0, 0, 7))
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 ma6 ma7 intercept
## 1.3124 1.2467 1.0475 0.7859 0.5725 0.3799 0.1772 0.2857
## s.e. 0.0049 0.0084 0.0091 0.0078 0.0100 0.0123 0.0075 15.9510
##
## sigma^2 estimated as 279806: log likelihood = -359794.5, aic = 719606.9
AIC(model)
## [1] 719606.9
The AIC is increasing very slowly after the q=5, therefore q=5 is enough for the model, with considering the complexity and speed of the model too.
Comparing AIC values of the best AR and MA models, AR model with p=24 and p=6 has a lower AIC value compered to MA model with q=5. To obtain an even better model, let’s start with an ARMA model with p=6 and q=5. With p=24, my computer can not train the model due to high computational complexity:
model <- arima(df_train, order=c(6,0,5))
## Warning in log(s2): NaNs produced
print(model)
##
## Call:
## arima(x = df_train, order = c(6, 0, 5))
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 ar6 ma1 ma2 ma3
## 0.6564 0.3405 -0.3684 0.2533 -0.0860 -0.0438 0.6670 0.0990 0.2298
## s.e. 0.1810 0.2569 0.0675 0.1071 0.1311 0.0480 0.1814 0.0613 0.0479
## ma4 ma5 intercept
## -0.0314 -0.0372 0.3594
## s.e. 0.0874 0.0143 18.8209
##
## sigma^2 estimated as 274401: log likelihood = -359338.2, aic = 718702.4
AIC(model)
## [1] 718702.4
BIC(model)
## [1] 718816.2
The AIC value is better(lower) from the both best AR and MA models. To check residuals:
checkresiduals(model)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(6,0,5) with non-zero mean
## Q* = 1640.8, df = 3, p-value < 2.2e-16
##
## Model df: 12. Total lags used: 15
The autocorrelation is not completely eliminated; however, the residuals are normally distributed. To compare actual values, and predicted values of the test set, the visualization should be realized. For the seasonality, I took the first 1424 period, since it has a replying pattern for each of the 336 data points. Also for trend component, I took the last 1424 trend values, since the last
model_forecast <- predict(model, n.ahead = 14*24)$pred
last_trend_value <-tail(df_additive$trend[!is.na(df_additive$trend)],14*24)
seasonality <- df_additive$seasonal[1:(14*24)]
forecast_combined <-model_forecast+last_trend_value+seasonality
df$forecast <- 0
df$forecast[(length(df_deseasoned_detrended)-14*24+1):length(df_deseasoned_detrended)] <-forecast_combined
p <- ggplot()+
geom_line(data = df[46500:47208],aes(x = NewDate,y=`Consumption (MWh)`,color='Actual Data') ) +
geom_line(data = df[(47208-24*14+1):47208],aes(x = NewDate,y=forecast,color='Prediction')) +
scale_color_manual(values = c(
'Actual Data' = '#007cc3',
'Prediction' = 'red')) + labs(title = 'Hourly Electricity Consumption(MWh) Prediction', x= "Date", y = "Hourly Electricity Consumption(MWh)",color = 'Legend')
theme(text=element_text(color="#004785"),axis.text=element_text(color="#004785"))
## List of 2
## $ text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "#004785"
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
fig <- ggplotly(p)
fig <- fig %>% layout(dragmode = "pan")
fig
The prediction is accurate for representing daily changes, but fails to represent increasing or decreasing trend over the days. It is more successful for predicting first 5 days, but then the daily trend is falsely predicted by the model. To evaulate the model and to measure overall accuracy of the model daily mean absolute percentage error for each date and weighted mean absolute percentage error (WMAPE) are calculated:
metrics <- df[(47208-24*14+1):47208] %>% group_by(year,day) %>% summarize(MAPE = mean(abs((`Consumption (MWh)`-forecast)/`Consumption (MWh)`)) * 100)
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
metrics
As can be seen above, MAPE is lower on the especially first 3 days.
wmape <- abs(sum((df[(47208-24*14+1):47208]$`Consumption (MWh)`-df[(47208-24*14+1):47208]$forecast)*metrics$MAPE*100))/sum(df[(47208-24*14+1):47208]$`Consumption (MWh)`)
wmape
## [1] 10.19357
WMAPE is approximately 10%, which is understandable based on the wrong predictions espcially after 3 days.
In conclusion, It is understood that the hourly electricity consumption data has highly predictable nature, and with the right model, hourly, daily or monthly prediction is achievable, even with simple ARMA methods. For the future work, SARIMA models can be introduced and autocorrelation can be eliminated with AR models with high p parameter.